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ABSTRACT 



A method of hierarchical digital video summarization and 
browsing includes inputting a digital video signal for a 
digital video sequence and generating a hierarchical sum- 
mary based on keyframes of the video sequence. Additional 
steps may include computing histog rams for the digi tal 
video sequence; de tecting sh ot boundaries within thq ^ igjtfll 
vide o sequence ; determinin g the number of keyframes to h e 
all ocated within each shgt; loc ating the actual position of 
each keyfra me within each shot; identifying keyframe loca- 
tion s Syrfie largest consecutive di fference criteria; pruning 
keynames tor an snot without meaningful action; extracting 
keyframes efficiently in the case of compressed video; and 
browsing the shots using the hierarchical keyframe sum- 
mary. 

30 Claims, 6 Drawing Sheets 
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METHOD FOR HIERARCHICAL ing on the amo unt o f "interesting actio n", in the shot. The 

SUMMARIZATION AND BROWSING OF current state of the art video browsing systems thus split a 

DIGITAL VIDEO video sequence into its component shots and represent each 

shot by a few representative keyfra mes, where the repre- 

RELATED APPLICATON 5 sentati on is referred to a s "the summlr VT 

"Method for Detecting Transition in Sampled Video The mvention improves and extends the method disclosed 

Sequences," of Krishna Ratakonda, Ser. No. 09/004,058, bv L - Lagendijk, A. Hanjalic, M. Ceccarelli, M. Soletic, and 

filed Jan 7 1998 ^' P ersoon J "Visual Search in SMASH System", Proceed- 
ings of International Conference on Image Processing, pp. 

FIELD OF THE INVENTION 10 671-674, Lausanne, 1996, hereinafter "Lagendijk." 

This invention relates to determining representation of a SUMMARY OF THE INVENTION 

digital video sequence by a set of still images in a hierar- t, . . , c . • * .„ , j 

, r t- i_ /"\ • i -j Ine mvention is a method or hierarchical digital vid eo 

chical summary for applications such as (l) visual identin- . 4 . , , . , . , , . . 4 J*-. / 

r -j * * /-\ * j j ■ j summarization and browsing, and includes, in its basic iorm. 

cation of video content; (n) video indexing; (lii) video is * — tz -j - 4 -i — — . j- •* \ -j 

, _\ f \ *j A- tl j- i r' y ' mputting a digital video signal tor a digital video sequence 

browsing; and (lv) video editing. The digital video sequence *\ L . l- i T j i 2 

, ' . v . * , _° and generating a h ierarc hical summary based on keyframes 

may be Moving Pictures Experts Group (MPEG) com- f 4l f & p. 7 : — H — 

j , * 4 4- u j * j -4i- of the video sequ ence. Additional steps may include com- 

pressed and the representation may be determined with , . . f tU a- y -a a * 

. . , , c 4i_ ju-4^ putmg histograms for the digital video sequence: detecting 

minimal decoding of the compressed bitstream. lu *i? a ■ \w *u a- •* \ -a a * • 

& r shot boundaries within the digital video sequence; de termin- 

BACKGROUND OF THE INVENTION ^ mg ^ c number of keyframes to be allocated wi thin each 

shot; locating the actual position of each" keyfra me within 

Compact representation of video is essential to many each sho t; ideTi uTyinlT key fra me locat ions by the largest 

information query and retrieval applications. Examples of consecutive difference criteria; pruning keyframes for an 

such applications range from multi-media database access to shot witnout meaningtul action; extracting keyframes effi- 

skimming (fast forwarding) through a video clip. Most 25 ciently in the case of compressed video; and browsing the 

previous approaches have mainly concentrated on splitting a shots using the hierarchical keyframe summary. 

given video segment into "shots." Each shot is represented "Video summarization" refers to determining the most 

by a keyframe which summarizes the shot. Thus one may salient frames of a given video sequence that may be used 

view these representative frames instead of browsing as a representative of the video. A method of hierarchical 

through the entire video. Shot detection may be achieved summarization is disclosed for constructing a hierarchical 

with high accuracy (>90%) and few misses (<5%). Histo- summary with multiple levels, where levels vary in terms of 

gram based approaches are among the most successful shot detail (i.e., number of frames). The coarsest, or most 

detection strategies as well as being the least computation- compact, level provides the most salient frames and contains 

ally demanding. A comparison between various shot detec- the least number of frames. 

tion strategies may be found in the literature. Many of these 35 ^ object of the invenlk)n ^ to provide a method for 

schemes also take into account some special situations of creating a Ue[a[dtiic ^ mu]ti .ievel summary wherein each 

shotTomtoies 0111 ' V6 m 11111111118 level corresponds to a different level of detail. 

Another object of the invention is to provide a method for 

Known techniques generally concentrate on detecting w improyjngkyirame selection. 

shot boundaries or scene changes and using a^clion object of ^ invenlion ^ [Q de , ect ^ ^ 

made up of a single f rame. ft orn each shot as keyfram es^ motion contem of me s eciflcall y ^ ^ ^ 

rzp^^^Jh^y^^Ass^ning more than one ^ tQ ^ ^ ^ hierarcbical &ame 

keyframe to each shot provides better summaries represent- 

. A * t ,. r f , summary. 

ing the video content. Such known summarization methods, Ae A „ , , . „ ( . . . , , , „ 

howe ver, provide a single h yer summary without any Hex- 1 5 Afurther object of the invention is to provide a method for 

rrpstinff n ni^rarr , hir , nl rrm It i -level ciimmaro of tn MPPfi- 

Other known techniques make use of color histograms i _ i ? i 4 

.ujrr . l • * r _ wnr-^ dinerent level ot detail, 

and describe methods for forming histograms from MPEG w . , , 

bitstreams (e.g., histograms of DC coefficients of 8x8 block 50 Y * ^ oih ^ ^ °!^ t mv ™*™* { ° P rovide a mel ^ od 
DCT). Although, this is relatively straightforward for I ° ? at 15 applicable to an MPEG-2 compressed video 
(intra-coded) frames, there is more than one way of recov- for constructing histograms and generating a hierarchical 
ering DC (zero frequency) coefficients of a P (predicted) s^mary with minimal decoding of the bitstream. 
frame or B (bi-directionally predicted) frame with minimal Another object of the mvention is to provide a complete 
decoding of its reference picture. 5S efficient system for generating summaries of MPEG-2 corn- 
Known references that are concerned with discrete cosine pressed video, 
transformation (DCI>compressed video however, do not Stm mother o b j ect of *c invention is to provide an 
address at all the practical aspects of a working system. For efficient way of handling histogram computation for MPEG 
example, after they are identified, keyframes have to be bitstreams. 

decoded for visual presentation. None of the known re fer- 6 o BRIEF DESCRIPTION OF THE DRAWINGS 
ences sp ecify an efficient mechanism for decoding key*- 

frames tbaT mav be positioned at arbitrary locat ions of the PIG* 1 is a representation of the hierarchical structure of 

bitstre^m T^thout decoding the entire video seq uence. a video summary for three levels. 

A major limitation of the above schemes is that they treat FIG. 2 is a block diagram of the first embodiment of the 

all shots equally. In most situations it might not be sufficient 65 method of the invention. 

to represent the entire shot by just one frame. This leads to FIG. 3 is a block diagram of an automatic pan/zoom 

the idea of alloca ting a few key fram es per each shot depend- processing module of the invention 



creating a hierarcbical, multi-level summary of an MPEG- 
2-compressed video where each level corresponds to a 
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FIG. 4 is a block diagram of a fine-level key-frame summarization is performed with minimal decoding of the 

selection algorithm of the invention. bitstream and with an efficient way of decoding the 

FIG. 5 is a block diagram of the hierarchical summary of keyframes, resulting in reduced computational and memory 

the invention capacity requirements. The examples provided herein are of 

FIG. 6 is an illustration of cumulative action measures 5 G " 2 COT ^^ d ^ but ' as J n ?^ d ^ f^- 

(C(x)), distribution of keyframes (k ) and corresponding shot c L able to ^ D CT-based compressed video. TW of skill in 

segments ft O the art will understand that a reference to an MPEG video is 

_ V 11 /, , . _ . . a reference to a compressed video stream, unless otherwise 

FIG. 7 is a block diagram of a portion of the second noted 

embodiment of the invention for use with an MPEG-2 in _ " , . . 

compressed input video hierarchical summarization method disclosed 

„V, „ . , . herein, detection of special effects, such as fades, via post- 

FIG. 8 is a representation of the data that may be used to processing is SU pp 0rted . Segments containing such effects 

decode the keyframes in the hierarchical summary. are 0Qt ^ M - n ^ summarizatioil process ^ order not 

FIG. 9 is a graph of motion compensation. to adversely effect its accuracy. Provisions are also allowed 

FIG. 10 is an illustration of the difference between the 15 in the method for detecting pan and zoom segments for most 

motion compensation algorithms used to define Case (a) and compact and expressive representation in the summary. 

Case (b). A video sequence may be indexed on the basis of its 

summary frames using techniques developed for still 
images. Multiple levels provide flexibility towards indexing 
at varying detail level. 

Because the current technology for automatic capturing of The hierarchical approach of the invention allows the user 
semantic saliency is not yet mature, video summarization quickly to browse through a collection of video sequences 
methods rely on low-level image features, such as color by considering their most compact summaries 22, with an 
histograms. Video summarization is a way of determining option of accessing a finer summary, 24, 26, if the content 
the most salient frames of a given video sequence that may 25 0 f the most compact summary is indeed interesting. A user 
be used as a representative of the video. It is possible that a 0 f the method of this inve nti on has the flexibility of refrnlTT^ 
particular frame carrying important information may not be the summary at selected segments of the video sequence. 
included in a single summary containing a pre-specified total When used to summarize a MPEG video sequence, two 
number of frames. ^ components, referred to as "bitstream index table generator" 
Referring now to FIG. 1, a hierarchical multilevel sum- m $ "decoder manager", are provided. These components 
mary 20, which is generated by the hierarchical summari- are necessary to efficiently decode the keyframes in order to 
zation method of the invention, may provide a detailed generate a visual summary and subsequently browse through 
fine-level summary with sufficiently large number of frames, the video without decoding it in its entirety, 
so that important content information is not lost, but at the 35 ^ overall method of the invenuon is summarized, 
same time provide less detailed summaries at coarser levels generaU y a t 30, in FIG. 2. The method is intended to operate 
in order not to hinder the usage of a coarse or compact on a video ^ recorder) such as a camcorder, or on 
summary for fast browsing and identification of the video. a system havmg me capability to, at a minimum, 
Hierarchical multilevel summary 20 includes a most com- place v ; de0 sequcnceS) an d, idea i lv7 to store large amounts of 
pact summary, 22, at the coarsest level, which should suffice ^ vidco data> which vidco data scrves ^ video input 32 . ^ 
until more detailed information is deemed to be necessary mcch anism which includes the method of the invention is 
and the finer level summaries are invoked, such as the coarse referrcd to hcrcin as a « system « Input video 32 is first 
summary 24 and the finest summary 26. Although three proccsscd to detect and remove f rames that are involfed 
levels of summation are depicted in FIG. 1, it should be specia] effeC . tSj su 5rTa7'fa de in or fade out. 347Ee^use fade 
appreciated that the hierarchical summary of the invention 45 in/out trames wil l result in sp™™* shnt boundaries and 
may make use of any number of levels greater than one. kefframes. Such frames are classified as global motion 
Summary 20 also facilitates fast browsing through a eventvand are subsequently excluded from further process- 
database of video sequences where browsing may be per- mg . The next step is histogram computation 36. Image color 
formed on the basis of the most compact summary and histograms, i.e., color distributions, constitute representative 
progressive refinement of the summary to more detailed 50 feature vectors of the video frames and are used in shot 
levels may be performed at user's request. boundary detection 3 8 and keyframe selection . Shot bound- 
Hierarchical, multi-level summarization facilitates an ary detection 38 is performed using a threshold method, 
effective way of visual interactive presentation of video where differences between histograms of successive frames 
summary to the user. Th e user may interact with the sum-. are compared. Giv en total number of k^yfram^s (use*- 
mary via a graphical us er interface, for refining the 55 s pecified^4 0. each snot jsagsjgned a number of keyfr ames" 
sumniaxy^yjsualizi ng diff erent levels ot the su mmary, an d 42 depending on the "acti on" within the sh ot, according to 
pla ying jackthe v ide o Detween any tw o keytrames of the well kno wrrtccfim ques. p mes t i evc ] keyframe selection 44 
su mmary aU tnyJevd. Users of the method disclosed herein is performed using an improved version of the Lagendijk 
may specifythe maximum num ber of ^y f rfim p ^ in the technique. The implementation disclosed herein includes an 
summary and the number ot leveGTof the hierarchy, and thus 60 improved version of this technique by incorporating addi- 
th e system is control lable iot limited memory ana reso urce tional new steps, as shown in FIG. 4, to be more fully 
applications. described later herein, wherein an expansion of the finest- 
""'Ihe methoH disclosed herein is applicable to both uncom- level keyframe selection method is provided, 
pressed (or decompressed) or DCT-based (discrete cosine Referring again to FIG. 2, the automatic pan/zoom 
transform -based) compressed video, such as MPEG com- 65 processing, 46, which results in generation of an image 
pressed video, or other motion -compensated predictive com- mosaic, 51, and a zoom summary, 52, are optional steps, and 
pressed video. In the case of MPEG compressed video, will be explained later herein. The next step is the new 
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meth od for flene rating keyframe hierarchy 48._i .e ., summary summarization and browsing, and required an uncompressed 

at coarser detail than the finest level summary. This process digital video input 32. 

is also described in detail later herein. It is based on a Assume that the total number of keyframes to be used for 
variation on the theme of vector quantization Once the the entire ^ nce * iven (which is normall 
finest and coarse level summaries are determined for a given 5 dictated b requirements). Lagendijk's tech- 
shot, the process is repeated for other shots in the video Qi has three key steps- 
sequence, block 50. Additional steps include browsing the v P ■ 
hierarchical summary, block 53, and termination of the 1- Detect shot boundanes, 

process, block 54. 2. Determine the number of keyframes to be allocated to 

Automatic Pan/Zoom Processing 1Q each shot, and 

The steps of automatic pan/zoom processing 46 are shown 3 Find me pos i t i ons 0 f the keyframes within each shot, 

in FIG. 3 which include detecting pan and zoom events in The used m lhis invention is depicted generally 

the digital video sequence^ Frames that contain global at 80 in FJG 4 afld mdudes a 3 itefative method g2 

motion are detected 56^ This is a pre-screening method D&itctia ^ boun daries, block 38, FIG. 2, is done using 

aimed at identifying those frames that undergo global . . . ° , , ' ... ' . ' . T * ■ 

m e . #1 ° / • i< a histogram based approach with a dynamic threshold. It is 

motion. These frames may be compactly represented using 13 6 , A A , £r / <, . 

- i u i *: j » assumed that the first n, typically n=3, frames of the 

an image mosaic, if the global motion due to camera pan, as , * , . ! . ^ 

^^♦^^ k„ V^^ t ^ ce rt „ u„ ™ m ™ mm ™ i' sequence do not correspond to shot boundaries. The mean 

detected by pan detector 5a, or by zoom summary 52, i.e., ?. * j *u „ j j j • * 

,t c * j i * a. c • / action measure and the standard deviation of action 

the first and last frames of a zoom-in or zoom-out sequence . . , , . 

j * * m. j j , measure A crf are determined by computing the mean and 

as detected by zoom detector 60 and compiled by zoom t , , A sd . t . J F * . " 

. <<> ,v «*i * a A 20 standard deviation of the action measures, respectively, 

estimator 66. Hence mosaic building 62 is only attempted ^ . „ , . t . , . _ A r ' *. , /j 

~ iL r , ... . , , , f. , defined later here in, across the first n frames. The threshold 

for those frames that exhibit a global pan motion and which . tt . . ' . . • \ > 7 t ' . 

rapll1t • . „„ A C1 ,5 , . , n , . . onro is set to A^+aA^. Once a boundary is detected according to 

result in image mosaic 51. Frames that take part in image . . iL 7" u i_ u • j * • j r iL . 

„ ■ ci * „ ™ m • _ , rtrtm _^ 110 „_ „ A this threshold, a new threshold is determined for the next 

mosaic 51 or in a zoom-m or zoom-out sequence are . . . . r . . . „ ^ . _ iL . 

^ i,,^ j tU ^ . . „„ r u^^utA tu a shot in this same fashion using the first n frames of this new 

excluded from the finest level summary, block 64, as the 1 c . 

c . ! A . c , , r A 9*; shot. Ine value ot parameter a typically is set to 10. 

finest level summary is further processed to form the coarser, . /a/w.; ^. 

t i i Ine action measure (A(.,.)) between two histograms (h 1 

more compact, levels. i i \ ■ •« £> i . i /. < 1 \ 

T r l j. . * • and ru) is defined to be (the L, norm): 

In an alternative embodiment, pan/zoom processing 46 ^ v 17 J 

may be ^^racyiYgly r n thrr than mi torn tit i rally The 

use r may select from finest keyframes sum n^ry Refected r 44 A(h lt h z )= ^|A|(/)-A 2 (0l tlJ 

thos e jcevframes tha * ^nst ih1tp thp ctart ^Tlf 1 ?n Hl p K of a j^ n. j° *" 
sequence, and the system may constructJmage mosaic 51Jn 

response, and pr esent it to the us ejJThe user maxja^nti^Lox— The cumulative action measure (C(.)) for a shot (s) with n 

tag frame numbers K an d L, i.e^_JJi£jwo keyframes hp.twft . e j a frames s a , . . . , s„ is defined to be: 

which there is a ca mera pan. Mosaic builder 62 considers ' " 

Iframe XTjelwe eu fr ame nu mber K-n and L+n in building a 35 Sj (2) 

^mosaic, wheie^'n" is^4JJ£detgrmme d oltset. Mosaic^m iaer C(x) V A{h Si ), {x <; n) 

62 may be implement ed according to ima^e stitching tech- j/=*i 
niques well knownjoih oco of o r dinary skill in tile ' art. 

In the case of zoom, as with pan, the user may m anually 

specify thebeginning and ending frames, or an^ utornltic 40 ^ cumulative action measure for each shot, and the sum 

zoomcEte55n^^ whichT again, of the cumulative action measures of each shot is thus found, 

is uaEmHthm weU-known toTC^rficdhiary skill in the The number of keyframes allocated to a particular shot < s", 

art block 42, is proportionate to the relative amount of cumu- 

Aform for the hierarchical summary is depicted in FIG. lative actioD measure within that shot. 
5, generally at 70. The hierarchical summary is divided into 45 Locating the actual positions of the keyframes within the 
hierarcl^l4^i^ea£yels. lie-user may be first presented shot ma y be P osed ™ an l i minimization problem. Each 
with the mbst compact (coa rsest) level summary 7 2, i.e ., the keyframe represents (and replaces) a contiguous set of video 
most compact summary, possibly along with image mosaic frames. The union of these contiguous sets of video frames 
51 and zoom summary 52. Then the user may tag a paren t * ^ entire shot * Since eacn of c 011 ^ 0115 ^ of 
and see the children) frames-in the finer level, referred to 50 video frames is represented by a single keyframe, one would 
hereirTlsTEoIrse level 74. Tagging frames in the finest level ^ t0 ensure that me amount of action within one contigu- 
76 r esults k p ta ^acToTthe video; for instance if the j-th ous set of video ham&s * smalL ^ rationale behind this is 
keyframe is tagged at the finest level, frames between the j mat tf mere is t0 ° much "action" within one contiguous set 
th and (j+1) st keyframes are played back. In an actual GUI of video frames, a smgle keyframe might not be able to 
implementation the children-parent relationships may be 55 represent it fully. Thus, gi ven the total numbe r of keyframes 
explicitly indicated during display. As used herein, "tag" or t0 be assigned to one shot Twitch is tne^gmc-as the number 
"tagging" may be accomplished by identifying a particular of contiguous sets into which th^snot is split), a minuni- 
objcct on a computer monitor, as by clicjun^jia^rjarticular =""2 proc edurg^ch-^rid s-the-k^ yframes that m inimize 
framejhe keyframesjnthejiie^^ be theJ^ctTga^ffim^rrespoiidin K contiguous sets of video 
spaTl allv sub-sampled i nto "thumbnails" for cost effe ctive 60 frames i^J^^^cn that K keyframes are to be positioned 
rtrnfe and f^^;^1^d-di^1 ^ nf the ^mm.rv within a shot s, let the location of the keyframes be k y (j-1, 
NormaTplayback of a video sequence will be at the finest • - • , K). Further, let t y . l7 . . . , ^-1 be the contiguous set of 
level, however, playb ack mav also he done at a coaVser level. Vlde0 framcs "Presented by the keyframe at k,. In other 

' words [L], ty-1] is the shot segment which is represented by 

.^UNCOMPRESSED VIDEO INPUT 6S mc keyframe k,. The following cost criterion must be 

The first embodiment of the invention is referred to herein minimized over all possible ty (k ; are determined by selecting 

as a "pixel domain" approach to hierarchical digital video ty, i.e., k^ty+ty^,)/!)): 
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*ir>'i 'r-i) = 



<-tX- 



\C(x)-C(kj)\dx 



(3) 



Note that t^ and t^ are the first and last frames of the shot 
(and hence are constants). Once ^(t/^.i)^ is substituted in 
the above cost criterion, the optimum solution satisfies 
2C(t,)-C(k ( )gC(k 1 . +1 ) 



successive frames. It is possible that these two frames may 
be among those selected as keyframes if the "middle of the 
segment" rule is applied. Thus blindly choosing the "middle 
of the segment" frame as the keyframe might result in 
erroneous selection of keyframes. 

In this embodiment, the resulting set of breakpoints 
within each shot, {tg, t lt , . . , t^}, obtained by the 3-step 
iterative method is considered. That frame in the segment 
(ty.j, which is most different (in terms of the action 



In order to carrjTout the minimization, following steps are 10 mc f u ™ from the Previous keyframe ,), is selected 

. J . . . _ o r f l Votrfr^ \nr*n*s»A a* lr TUic- (Via impel 



performed in an iterative way, which differs from that of the 
Lagendijk technique. 

1. Set kj=l (assume that t o =0 and the second frame is 
chosen as a candidate for being the first keyframe.) 

2. For i=l through K-l 

Define k, +1 to be the first video frame (i.e., video frame 
with the smallest subscript, n, that is greater than t t ) 
for which 20(0-000^0^) holds. 

3. For i=K, compute i l x=2 k^-t^.j. If ^k > ^x~^k-i = ^k* 
increment k x by 1 and Go to Step 2. Otherwise, keep 
the results of the previous iteration, add an offset to all 
k,s so that t^t'^, and Stop. 



as the keyframe located at 1^-. This strategy takes the largest 
difference from the previous keyframe, and is referred to 
herein as the "largest consecutive difference" criteria, block 
84, The first keyframe, (kj), is taken as the one determined 
15 by the 3-step iterative method. This method ensures that the 
successive keyframes are sufficiently different from each 
other, thus reducing redundancy as much as possible. 
Reducing the Number of Keyframes in Shot Segments 
without Meaningful Action 
20 Because Lagendijk's technique is entirely based on cumu- 
lative error, as explained above, it might report large errors 
between two frames which are, in fact, very close together. 
Although the techniques introduced above is good for 
choosing the most interesting frame in a given shot segment, 



The minimization may be carried out in a finite number of 25 it does not resolve the situation where the entire shot 
steps, as depicted in FIG. 6. Cumulative error is a non- segment is "uninteresting" from a standpoint of action 
decreasing function within a shot. Thus the above minimi- within the shot segment. For instance, there may be an 
zation procedure is aimed at finding those keyframes, k ; -, accumulation of error due to slight camera movement which 
which give the best stair case approximation (best in the l x does not result in much meaningful change between suc- 
sense) to the cumulative error curve 90. This results in a 30 cessive keyframes. 

distribution of keyframes kj which varies adaptively to the In order to ignore shots without any meaningful action, 

amount of "action" in the shot. The area to be minimized, as the shots are identified and keyframes for those shots are 
expressed by the integral in Eq. 3 is depicted at 92. pruned, block 86, which leaves the finest level of keyframes, 

The meaning of the third step above is as follows. The last block 44. This is done by evaluating the mean and standard 
keyframe of the shot should be as close as possible to the 35 deviation of the action measure between successive video 



mid point between X K _ X and M K . Increment kj and repeat 
steps 2 and 3 until this midpoint is exceeded for the first time 
and then take the results of the previous iteration and offset 
them such that the last keyframe coincides with the 



sequence frames which lie between two given keyframes is 
determined and analyzed. If there is enough "meaningful 
action" between two keyframes, then the action measure 
between successive frames in the original video sequence is 
midpoint, i.e., t K -2k JC -t Jt ._ 1 , and the t*^ determined by the 40 significant, i.e., the keyframe is identified according to the 



3-step iterative method coincides with i K . 

Another novelty introduced to the previous algorithm 
relates to cases where one may overshot the shot boundary 
even with kj=l due to a sufficiently large number of key- 
frames assigned to this particular shot. In this case, a simple 45 
scheme is used to distribute the keyframes in such a way that 
they are equispaced. In the simple scheme, if a shot has n 
frames and K frames are to be allocated, every (n/K)th frame 
is selected as a keyframe. 
An Improvement in Keyframe Selection 

In Lagendijk's technique, the keyframe for a shot segment 
[ij_ lt t ; -l], given t^ and t ; -l, is always located at k,«(V + 



largest consecutive difference criteria, block 84. 

Thus if A,,, is the mean action measure between keyframes 
k t - and k^. and A^ is the standard deviation of the action 
measure: 



(4) 



t,_ a ,)/2. In other words, the keyframe is always selected to be 
in the middle of the segment as representative of the frames 



if the content between the two keyframes is interesting, 
where s is the number of video sequence frames between the 
two keyframes k ( - and k ( _ r If the shot segment is uninter- 
50 esting in the above sense, that particular keyframe is deleted 
and the shot segment is merged with the next shot segment. 

The parameter p io the above expression is a constant. If 
fj is less than 1, only keyframes with large differences will 
survive, which may result in excessive pruning. The value of 
in the segment. However going back to the definition of 55 p is chosen to be 2.0 for the simulations reported herein. The 
cumulative error, the cumulative error is dependent only on quantity (s/p) increases if the number of keyframes allocated 
the absolute change between successive frames. Thus, a to the shot is small because the distance between keyframes, 
keyframe in the middle of a segment might not be rep re- and hence the number of frames between keyframes, s, 
sentative of the actual change between two frames that are increases, when the number of keyframes allocated to the 
separated by more than one frame. Consider a video 60 shot is small. The maximum value that (s/p) may achieve is 
sequence in which a reporter is talking. Assume that there set to a, where a is the factor used in defining the threshold 
are two frames which are, for example, 10 frames apart, and for shot boundary detection, in order to limit the amount of 
that both frames show the reporter with an open mouth. pruning of keyframes. 

Consequently, the two frames appear to represent very little Further experimentation revealed that the linear thresh- 

change, or "action." However, the cumulative change 65 olding scheme might result in uneven keyframe allocation 
between the two frames might be large, since the cumulative for some choices of total number of keyframes. In order to 
change represents the sum of the absolute changes between alleviate this problem, a limit MAXERASE=0.3, is set on 
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the maximum percentage of the total number of keyframes start with an equally spaced partition of the sequence of 

which may be erased by the above pruning method. Id the histogram vectors. For example, for a compaction ratio of 3, 

limiting case, the most redundant 30 percent of the frames, each partitioned set contains 3 histogram vectors (except 

corresponding to MAXERASE=0.3, will be removed. The possibly the last one or two sets). Then go through the 

meaning of "most redundant" is to be taken in the sense that 5 following steps for the PLBG method: 

A(k,-, k,-^) is the smallest for the most redundant keyframe. 1. Assign the centroid (or mean) histogram as the repre- 

In this case, k, will be the redundant keyframe to be sentative vector for each set of vectors. 

removed. 2. Starting with the first partition, adjust each partition so 

This technique, when applied to a high-motion sequence as to minimize the total 1 2 norm for the two adjacent 

does not produce any change as expected because the 10 ^ on either side of toe partition (hence the term 

motion is mostly constructive, i.e., Eq. (4) is satisfied for all pair-wise). Mathematically, if H w is the representative 

deleted keyframes; thus, there is no redundancy. vector for . thc vcctor r s m » l 0 ™* H, is the 

Hierarchical Summary and Browsing representative vector for the vectors in the set (t„ t l+1 ) 

Although the above disclosure describes an intelligent l ^ ch ^ s ™ ° f me squared distances 

t( • j *j * )j t u * ' i i of the vectors in each set to the corresponding repre- 

video indexing^ system, such a system provides only a 15 _ t . _ ■ • ^ r b r 

c , r j L • . i . 3 . sentative vector is minimized, 

fixed sequence of video frames, which is a more compact ^ TjC c „ . A . , t c ^. . 

represenution of the video content than the original fell 3 ' * f °j 'T^n? ^^^U'T^ fHi 

r r , r t ... delete tL from the representative set of vectors. If 

sequence of video frames. In most situations, this represcn- t ^ H &0 £ ^ set of representativc vcc . 

tation is still inadequate, as the level of interest in a video 

sequence varies as one moves along the sequence and the 20 4 Go to step 1 

content changes. Also, the level of interest in a particular ^ stopping critcrioQ may be cither based Qn ^ amounl 

video content cannot be predicted. Consider a video of decrease in distortion, or a fixed number of iterations. As 

sequence in which a girl is petting a cat: the camera pans previously noted, stop after 10 iterations. At each iteration 

from the girl to the cat. One person might want to see the cat the distortion norm between the representative vector of 

more closely but not the girl; whereas another person might 25 ea ch set and the corresponding vectors in the set) is reduced, 

want to see the girl closely but not the cat; yet another person Thus, the total distortion at each iteration forms a decreasing 

might want to see both of them. The goal is to minimize the sequence. Furthermore, distortion is always greater than or 

number of "uninteresting frames" that any one of these equal to zero. Hence the sequence has a limit by elementary 

people watch. real analysis. Questions such as: "Is there a local minima 

In order to reconcile and satisfy diverse viewing require- 30 (and hence a fixed point) for the iteration?" are purely 

ments with the same video indexing system, a multi- academic and the reader is referred to the literature for such 

resolutional video browser, block 53, FIG. 2, is provided to discussion. The deletion step (step 3) might actually result in 

allow a user to browse the hierarchical summary by select- a slightly smaller number of keyframes were originally 

ing a specific level summary. This is a browser instead of a expected or selected. 

mere indexing system. A viewer may start at a coarse level 35 In the above method, after stopping, the frame in the first 
of detail and expand the detail with a mouse click at those cluster whose histogram vector is closest to the representa- 
parts of the keyframe sequence which are more interesting tive vector is selected as the first keyframe. Keyframes for 
to the viewer. More than one level of detail is required so subsequent clusters may be determined in the same way. 
that the viewer may browse at a viewer-selected pace. The Better results are obtained when keyframes are selected 
finest level keyframes still may be detected. At a coarser 40 within subsequent clusters according to the "largest differ- 
level, similar keyframes at the fine levels are clustered ence from the previous keyframe criterion", where the 
together and each cluster is represented by a representative difference is expressed in terms of the action measure, 
keyframe. In the formulation of the above iteration, there is a 

To solve this clustering problem, a modification of the possibility that the last set may be inadequately represented 

well known Linde-Buzo-Gray (LBG) algorithm (or Lloyd's 45 because the last partition is always fixed to the last vector in 

algorithm or K-means algorithm) is proposed. Note that it is the sequence. The same may be said for the first frame of the 

desirable to cluster similar images together. Assume that shot, however, such a situation was not observed in the 

images are represented by their histograms and that similar experiments reported herein. Thus, another step is provided 

images have similar histograms. Treating each histogram as after the completion of iteration to resolve this problem. In 

a feature vector of its associated frame, find (N/r) represen- 50 this final stage, test whether one more representative vector 

tative histograms at the coarse level to replace the N need to be added at the end of the representation, 

histograms in the finest level, where N is the number of Specifically, consider adding the last vector as the new 

keyframes at the finest level. The parameter V is the representative. If the difference between the last vector and 

compaction ratio and is a parameter to be supplied to the the previous representative vector is less than 8X (mean of 

program by the user. In the discussion which follows, 55 the differences between all other pairs of successive repre- 

keyframes are expressed in terms of their histogram vectors. sentative vectors) the last vector is allowed. Chose 0 to be 

This is different from a regular clustering problem 0.75 during the simulations. Note that 0 may vary between 

because it is desired to pick a representative vector to 0 and 1. 

replace, for example, p consecutive vectors (in time). In the The baseline approach (Lagendijk) misses the scene that 

regular LBG case, there is no "consecutivity" restriction on 60 has a feature of interest if 3 keyframes are specified and the 

the vectors quantized to one representative vector. The baseline approach is applied to a video sequence. The results 

following iteration, which is similar to the regular LBG are inferior to that of the most compact (coarsest) level of the 

iteration, will always converge. This new 3-step iterative multilevel hierarchy with 3 keyframes, generated using the 

method is referred to herein as "pairwise" LBG, or PLBG. above method. Further, it is much more efficient to utilize the 

It must be noted that PLBG has the same local minima 65 proposed hierarchical approach than applying the baseline 

problems as LBG. Fortunately a "cleanup stage" after the algorithm multiple times to obtain different numbers of 

iterations may be used to quickly take care of this. Initially, keyframes to generate a multi-level summary. 
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Block Histogram Action Measure 

Histogram-based action measure is not adequate in all 
situations. For example, if a black object translates against 
a white background, the histogram -based action measure 
would not register the movement. In situations where it is 5 
desired to catch fine morion, for example, hand gestures or 
head movements, it is advantageous to have a better action 
measure. 

Block histograms have been proposed for shot detection. 
However, it was concluded that block histograms were too 10 
sensitive for shot detection and give rise to a number of false 
alarms. The idea behind block histograms is to split the 
image into a few blocks (4 or 16 is usual) and define the 
action measure to be the sum of the absolute histogram 
differences over each block. It may be easily seen that block 15 
histograms would be more sensitive to motion which would 
not be caught by a simple overall histogram based approach. 
Block histograms were used experimentally for the finest 
level keyframes only, as shown in the 3 -step iteration 
method of FIG. 4. The disadvantage of using block histo- 20 
grams is that it is computationally and memory wise more 
intensive as it is necessary to deal with 4 or 16 histograms 
per video frame instead of just one. In experimental 
sequences, however, it was found that the block histogram 
approach did not result in significant perfornance improve- 25 
ment. 

Using Motion Characteristics for Summarization 

The special cases of interest, such as pan or zoom, have 
not, so far, been considered. In the case of a camera pan, an 
intelligent browser should (a) detect the frames with a pan 30 
and (b) provide an option for the pan frames to be converted 
into an image mosaic for viewing purposes. Since detection 
of pan and zoom both involve computing motion vectors, 
zoom detection along with pan detection may be achieved 
without much additional computational overload. 35 

Because finding the motion vectors for each frame in a 
sequence is computationally demanding, a pre-screening 
method is developed which first detects all possible 
sequences of frames with dominant, or global, motion. Since 
dominant motion may be caused by (a) pan, or (b) zoom, or 40 
(c) other special editing effect, the detected sequence is 
examined more closely to determine the existence of a pan 
or zoom. 

Pre -Screening for Dominant Motion 

Dominant motion implies that each pixel within the video 45 
frame experiences a change in intensity. This change in 
intensity is usually caused by zoom or camera motion. This 
change will be most noticeable in edge pixels of the video 
frame. The approach is to look at each pixel and determine 
whether it is an edge pixel, and if so to find the difference 50 
between the current pixel and the pixel at the same location 
in the previous frame. If the absolute value of the difference 
at an edge pixel is greater than a threshold (PZ_THESH= 
15), the pixel is designated as having motion. To determine 
whether a pixel is an edge pixel, the value attained by the 55 
Sobel edge-detection operator at that pixel is compared to a 
threshold value (PZ_THRESH1=50). If PZ_THRESH is 
reduced, one might obtain false alarms. If PZ_THRESH1 is 
reduced, there might not be a significant change at such 
pixels because they do not belong to strong edges, motion 60 
might not cause much intensity variation. In order to deter- 
mine whether a particular frame is a pan frame, threshold on 
the ratio (pan ratio) between the number of pixels which are 
classified as having motion to the total number of edge 
pixels (PZ_THRESH2=0.9). Another step needed to ensure 65 
that the ratio crosses PZ_THRESH2 consistendy through- 
out the pan is to fill out the neighborhood. In other words, 
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an edge pixel has motion if the intensity variation of any 
pixel within a NEIGHxNEIGH, where NEIGH =5, is greater 
than PZ_THRESH. Sequences of frames which are shorter 
than a particular number are rejected (TOO_MANY_ 
FRAMES_NOT_PANZOOMo5). Subsampling may be 
used to further reduce computational burden. 
Pan Detection 

The approach for pan detection is a variation of known 
techniques. In order to detect a pan, look at the motion 
vectors at subsampled pixel locations (SPACING=24), The 
method used to determine motion vector is simple block 
matching (BLKSIZE=7x7, SEARCHSIZE=24x24). Vary 
the search size based upon the pan vector of the previous 
frame: the search size is halved if the previous pan vector is 
smaller than (SEARCHSIZE/2)-2; original (larger) search- 
size is restored when the previous pan vector is greater than 
(SEARCHSIZE/2)-2, This variation results in no perfor- 
mance degradation. 

For pan detection, it has been proposed to find all motion 
vectors parallel to the modal (most frequently occurring) 
motion vector within a tolerance limit. If the number of such 
motion vectors is greater than a particular threshold, a pan 
is detected. However, in the case of a pan, not only are the 
motion vectors parallel, they also have approximately the 
same magnitude. Therefore, a small neighborhood of the 
modal motion vector is examined, instead of looking at all 
parallel motion vectors. If a tie in the value of a modal 
motion vector occurs, an arbitrary decision is made. The size 
of the neighborhood is controlled by VARN (-4). Larger 
values for VARN would lead to a smaller neighborhood 
around the modal motion vector (VARN-4 in our case 
implies a 3x3 neighborhood). PANRATIO (-0.5) determines 
the threshold on the ratio between number of motion vectors 
within the neighborhood to the total number of motion 
vectors. Even if some frames in a sequence of pan frames 
fall below the thresholds, continuity of the pan is ensured, if 
the hole is not bigger than 3 (TOO_BIG_A^_HOLE-3). 
Zoom Detection 

Examining the outermost rim of motion vectors in an 
image, i.e., motion vectors at the edges of the image, should 
detect zoom conditions. Motion vectors at diametrically 
opposite positions of the rim should point in opposite 
directions. Threshold (ZOOMRAITOO.7) on the ratio of 
motion vectors pointing in opposite directions to the total 
number of motion vectors. Only the motion vectors on the 
outer rim are used because the center of zoom might be 
located anywhere within the image. Thus motion vectors at 
the outer rim are the best indicators of the presence of a 
zoom. Additionally, there is not as much foreground motion 
at image edges. 
Color Processing 

In this portion of the disclosure, the previously disclosed 
methods are extended to color sequences. Two different 
embodiments are described. In the first embodiment, a 
concatenated histogram consisting of a 256-bin Y-histogram 
and two 128-bin U and V histograms is used. In the second 
embodiment, a simple 256-bin Y-histogram is used. For 
some experimental sequences, no significant change in 
results were observed. In both cases the activity measure is 
defined as in Eq. 1. However, in some sequences using a 
color histogram may be crucial for detecting change 
between two video frames, e.g., the luminance stays nearly 
the same but chroma values change. 
Summary of the Uncompressed Video Input Method 

A block diagram of the hierarchical summary and brows- 
ing method is shown in FIG. 2. The dissolve, fade in/fade 
out, removal module is explained in the cited related 
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application, and included herein by reference, and discloses second on a SUN® Ultra SPARC-2®. Thus, provided that 

a dissolve detection method. The module is used to convert histogram computation may be achieved in real time, it 

a dissolve into an abrupt scene transition by removing the should be easy to achieve real time hierarchical keyframe 

transition frames from the video sequence. The finest level generation. It may also be noted that the processing after the 

keyframe detection block is expanded in FIG. 4, where 5 computation of the histograms is independent of the actual 

major steps arc shown. The automatic pan/zoom auto pro- frame resolution, thus the amount of time taken to process 

cessing module is presented in detail in FIG. 3. It automati- a 300 frame QCIF sequence is the same as that of processing 

cally (a) detects and builds a mosaic (panoramic) image if a sequence at 1024x780 resolution, provided that the histo- 

there is a pan and (b) detects and finds the first and the last grams of each frame have been pre-computed. 

frames of a zoom sequence. It also excludes the pan/zoom 10 Currently global motion detection may be carried out in 

related keyframes from the finest level keyframes, so that real time. However, due to the heavy computational burden 

only non-pan and non-zoom frames participate in the hier- associated with the block matching algorithm, which is 

archical keyframe generation process. This removal and required for pan/zoom detection, pan/zoom processing may 

auto pan/zoom processing is optional and may be enabled not be carried in real time in a software implementation, 

interactively at only certain portions of the video clip by the 15 

user if desired. The GUI will allo w the user to start brows ing COMPRESSED VIDEO INPUT 

the video at a partim l ni J i^l^ ^mary (amo^ jic In thc foregoing discussion, only uncompressed or 

vanousjevels generated by the hierarchical browser)-. For decornprcssc d bitstreams were considered and used in the 

instanceTThe coarsest summary along with mosucimages experiments. However most of the available video streams 

and zoom_sim2maryj^ be presented first. The^tHe user 20 are in a compressed format for compact storage. The method 

marirneraT*^ at finer hierarch y of mQ 4 may be extcndcd (Q a essed bitstream m such 

levels,. With the click .of a button the user may access eitner a way as to cxtrad keyframes while performing minimal 

the parent-children of the keyframe currently being viewed. decodingt It will bc appreciated that a brute-force method of 

I C j* osm S t he parent w,)1 resu l t m the replacement of a gro up deaHn ^ compressed video may bc simply to dccom _ 

ofkeyframes at the^rren^ 25 the ento yideo stream> there after using the techniques 

I isn^arent. Choosing the-cHdren wiU find all the child described herein for unc0 mpressed video, 

keyframes corresponding to the current keyframe. FIG. 5 c , J . . , , 

illustrates this concept of parent and child keyframes. At the fc . Th » J? 0 " 1011 of <^°^f *«* with a vacation of 

finest level further expansion, i.e., the children at the finest h^rarchical summanzation and browsing of digUal video as 

level, will lead to the playing of the video clip between 30 may be used with MPEG-2 bitstreams. The overaU scheme 

specified keyframes. At the time the video is played, sound » ™ anzed ™ flow diagram given m FIG. 7. Anovel 

corresponding to that part of the video clip may also be °f computing histograms is disclosed. Ftotograms of 

synchronously played. This functionality of playing the DC coefficient of 8x8 blocks are used. The process begins 

video clip may also be provided at coarser levels of the ^ 311 m P* 132 " Histogram computation 134 for 

hierarchy 35 P ictures 1S therefore straightforward by methods well 

Hie video browsing method described herein may have t0 th , ose °{ ord f V* &W J n art " Histogram 

applications which go beyond simply providing an effective ^ P * tat T f ° r ^ f & B 

user interface for multi-media manipulation. It provides an ^ ^ co6m u ^ {h ™ ^ence frames, is performed as dis- 

understanding of the temporal nature of the video sequence closed herein, resulting in increased accuracy in his- 

which may be potentially employed in second generation 40 " > **y**™ «lec^ Hierarchical 

video coding systems, reminiscent of second generation 1 ke ^ ame ^™ * 36 determines the identities of the 

image coding systems. For example encoders designed to ^yframes of the hierarchical summary, for instance in 

deal with an MPEG-2 bitstream blindly adapt an IBBP or terms of the ; r tem P oral ^play order, and provides this 

IBBBP format. However, a hierarchy of keyframes may be ~* ^ Jo a decoder manager that will be 

used in designing encoders which intelligently, and more 45 dcscribed later here^- Once the histograms of DCT coeffi- 

importantly, computationally efficiently, adapt to the nature ? eDts ™ f ne ™ ted > hierarchical keyframe selection is per- 

of the temporal video stream thus providing higher quality formed as tau g ht 10 connection with FIG. 4. 

while utilizing lesser resources. Information on how to It should be noted that a mechanism for detecting dissolve 

utilize a hierarchy of video frames in improving compres- regions in the video, such as the one disclosed in my 

sion is available in the literature, where the multi-scale so co-pending application: "Detecting Dissolve Regions in 

nature of a segmentation algorithm is exploited to obtain video Sequences," cited above, may be easily integrated to 

lossless still image compression. Amajor difference between processing block 134 in FIG. 7 that performs histogram 

second generation image coding systems and second gen- computation and BIT generation. Namely, frames contained 

eration video coding systems is that the former necessitated in a dissolve region may be marked within BIT and ignored 

a fundamental change in the coding mechanism, and hence 55 ™ the subsequent keyframe selection process. Otherwise, 

failed to make much impact, while the latter may be incor- frames within the dissolve region may give rise to spurious 

porated within any of the existing video coding standards. keyframes. 

Computational Performance The method generates a record of the bitstream, concur- 

The computational performance of the keyframe genera- rent to histogram computation 134 that contains information 

tion method depends heavily upon the hard disk access 60 about each picture, such as their byte offset location in the 

speed of the computer used to practice the method of the bitstream, their reference frames, and the quantization 

invention. In the following discussion, "real time process- matrix used in quantizing the DCT blocks. In the current 

ing" means the ability to process 30 frames per second at a invention, a table referred to as the "bitstream index table" 

given resolution. For a 300 frame quarter common interme- (BIT) is generated. The contents of BIT 138 and the method 

diate format (QCIF) color sequence (176x144 resolution), it 65 of generating BIT is discussed in detail later herein, 

was found that construction of the histograms took 11 One purpose of BIT 138 is to capture the essential 

seconds, while the rest of the processing took less than a parameters of the bitstream in order to enable decoding of 
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the keyframes for generating a visual summary without the / 2. Establish a procedure for decoding the selected key- 
need for decoding or parsing the entire bitstream. Parsing / frames from the MPEG-2 bitstreams without having to 
requires that the system look at every bit in the video stream, / decode all the frames. 

regardlessof whether the video stream is decoded or not. In 3. Develop a strategy to decode a stretch of frames 

addition, the BIT or a slimmed down version of BIT, is 5 | between two given keyframes. 

provided along with the original bitstream and the identity of This approach works at the histogram level. A method is 

the summary, as depicted in FIG. 8, for efficient browsing by disclosed that computes a color histogram for each frame 

the user when the user, for instance, wants to visually while minimally decoding the MPEG-2 bitstream. 

display the summary or playback the video between two Histogram computation and consequently subsequent 

keyframes. Later herein, a specific embodiment of the 10 processing is insensitive to subsampling by a factor of 8 in 

method is described wherein a summary is presented to the ea ch dimension. Going one step further, it was found that 

user and some interactivity is provided. Note that in FIG. 8, histograms computed using only the DC component of the 

the bitstream may reside in memory located at a different DCT of 8x8 blocks, i.e., the mean of 8x8 blocks, were 

physical location than the BIT and the identity of summary sufficient for practical purposes. For motion compensated 

frames. For instance, bitstreams may reside in a database 15 images, it has been proposed that approximate motion 

server and the summary and the BIT may reside at the local compensation could be used to reduce the computation 

machine. Alternatively, all three types of data may reside in while obtaining negligible degradation in performance, 

the same medium such as a DVD disk or any other high According to the block matching scheme used in the MPEG 

capacity storage medium. Methods for further compaction standard, a 16x16 macrob lock motion vector may overlap, at 

(or pruning) of BIT are discussed in the section entitled 20 m0 st, four other 16x16 macro blocks in the reference frame 

"Generating a pruned bitstream index table for compact g- om which motion compensated prediction is being per- 

storage " It should be appreciated that, having generated formed. Similarly each 8x8 sub-block within the 16x16 

BIT, and having decided to "prune" the size of BIT, any macro block overlaps, at most, four other 8x8 sub-blocks, 

number of techniques may be used to down-size BIT. A Thus, it was suggested that each 8x8 sub block may be 

single example is provided herein. It should also be noted 25 approximated by a weighted aver age of the values in each of 

that it is possible not to form and store a BIT at all, but to the 8x8 sub -blocks that it overlaps. The weights assigned to 

parse the entire bitstream and decode everytime a keyframe individual blocks could be made proportional to the area of 

needs to be decoded. the overlap. Referring to FIG. 9, the 8x8 sub-block's mean 

Referring again to FIG. 7, during generation of hierarchi- value is: 

cal summary 140, the information contained in BIT is 30 

utilized by decoder manager 142 to selectively decode the ((*X*)(wi)+(^«)(fc)((wa)H«)(^*XM2>(^*)(8-*)(w4)+Ai>cr)/W 

keyframes, passed to an MPEG-2 decoder 144, and, once , A • o *ut^ * <- *u 1 

j j j r i_- u-i ia<\ t\ * where An<-r is 8 times the DC component of the residual 

aecoaea lorms ueructucu summary iw uecocer man- DCT for Ite block (the factor of 8 comes in because the DC 

ager 142 performs a similar task during the presentation . , t , v . . , ~™ . , . . , . 

f *; j . * *i_ l *L *j u « component of the residual DCT for the block used m the 

stage, as the user desires to browse through the video by 35 x>fD ur- o C i*„A* r A rtrio ™mw „f *u„ m0 o„ „ Q i„o «f tu 0 

. °. , , .j , . ,ri_r r™ 1- MPhCj-2 standard is one-eigntn ol the mean value ot the 

playing back video between the keyframe s. The working .« - , . - , 5 rr . . , t . - . 

r . V? irrT — 71*r i 1: i * 5 residual error of the block). Histograms are obtamed by 

principles of the decoder manager (that may be implemented , . ' , , Q i 

F r . r ■ * \ j * j i i updating the histogram vector with the mean of each 8x8 

by a computer program, lor instance) are discussed below. , f , * . iL . c j ■ r c ^ , dl _ , 

J r r ° 7 block within the image found as m Eq. 5, The above method 
The invention may be implemented within a video camera ^ of obtaining histograms has certain problems leading to 
that is storing MPEG-2 compressed video, subsequent to possible degradation of performance. One of the improve- 
recording. In such a case, the summary information and BIT ments of me method of the ^y^^ & to propose a better 
may be stored in a storage system that also stores the video way of handling histogram computation for MPEG bit- 
stream, or they are stored in any memory location that is streams. 

linked with the video stream in a well-defined fashion. The U? ^ G bitstreams incorporate complicated coding strat- 

hierarchical summary itself, containing the keyframes, or egies whicn necessitate decoding information from other 

their subsampled versions, may also be stored in a storage parts of the bitstream before one may attempt to decode a 

system for immediate access. On-camera user interface may particular frame. A successful video browsing strategy also 

be provided for identification of video content stored in the needs t0 address the problem of decoding particular video 

camera, on tape, or on any other storage medium on the basis ^ frames in me minimum amount of time, 

of the hierarchical summary. Computing Histograms from MPEG Bitstreams 

Alternatively, bitstreams may be downloaded from a Decoding an MPEG bitstream involves two computation- 
camera to a computer where the summarization process is ally intensive steps: 

carried out. In this case, the summary may be copied back x Obtaining inverse DCT of 8x8 blocks. 

to the tape or any other storage medium holding the video $5 2 Motk)n nsation with 16xl6 macr0 blocks in the 

«• ^ ° n i v u° , m l m ° ry V S a of MPEG-2 bitstreams, the blocks may be smaller 

well-defined link to the video bitstream. For instance, cam- of haye Qm even/odd fields> 

eras that directly record compress^ MPEG streams are Pre viously, it was pointed out that replacing an 8x8 block 

currendy available (e^g Hitachi MP-EG1 A camera) where b ^ mean vaJue ^ ^ faave much effect 0Q the histQ _ 

bitstreams may then be downloaded to a PC. The system of fi0 of ^ ^ ^ ^ implementatiori) each 8x8 block 

the current invention may be used to process such bitstreams k repUced by 8x(DC yalue of the DCX ^effi^t^ From 

on a PC platform. the formula for i nverse dct computation it may be seen that 

The following issues must be addressed and resolved in this yields the mean value of the block, accurate within 

order to make the hierarchical video summary work effi- compression related quantization error, 

ciently with MPEG-2 bitstreams: 65 t d order to understand the next step, a brief review of the 

A. Generate a keyframe hierarchy while performing mini- coding strategy employed in an MPEG bitstream is pro- 

\ mal decoding of the MPEG-2 bitstream. vided. Atypical MPEG bitstream has three kinds of frames: 
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I (intra-coded frame), sufficiently long sequence of motion compensated frames, 

B (bi-directionally predicted frame), and one would expect Case (a) to yield a single value for the 

P (predicted frame). entire frame, provided that there is sufficient motion between 

I frame contains only DCTdata(no motion compensation is frames, as described later herein. This however, does not 

performed). Thus using the DC value of DCT coefficients to 5 occur in Case (b). 

compute a histogram completely covers the problem of In order 10 ex P lain mis phenomenon more thoroughly, 

minimally decoding I frames. B and P frames involve the consider replacing each 8x8 block in the I frame by its mean 

additional step of using block motion vectors to predict the t0 P. roducc a smal ? er vcr ? JOI l of ° r . l ^ al im * gC * N ° W ' 

current frame from previously decoded reference frame(s). motlon compensation as implemented in Case (a) necessi- 

Note that the previous decoded frame available has itself 10 ta L es rec ™ *{^™ °* a 2x2 averaging filter repeat- 

i u 4- ii j j a tpu tu * * , i edly on this small image. From elementary Fourier analysis, 

only been partially ^decoded. Thus, the strat^y to be used in u > ^ ^ sh( f WQ ^ ed a of an 

decoding the B and P frames must be carefully considered. av ' mg mter wou i d lead to a uniform image in the limit, 

In the following discussion, Case (a) refers to a motion ncg lecting edge effects 

compensation scheme which already exists and is commonly It was obscrvcd i Q practice that the above observations 

used in literature. Case (b) refers to a new motion compen- 15 no id (rue> p 0 r a typical MPEG-2 compressed sequence the 

sation scheme that is disclosed herein. distance between two I frames is 15. It was found that this 

In order to simplify motion compensation, most known lead to a very noticeable degradation of performance when 

methods use the scheme given in the previous section, where motion compensation was performed according to Case (a), 

each 8x8 sub -block is replaced by the weighted average of The motion compensation scheme of Case (a) produces a 

the 8x8 sub -blocks it overlaps. Consider the two scenarios: 20 strong periodical variation in the histograms which leads to 

Case (a) replace the 8x8 sub-block with the weighted spurious keyframe detection. Thus, Case (b) was used for 

average of overlapped blocks in the partially decoded ref- implementation. Computing the histograms using the mini- 

erence frame and Case (b) replace the 8x8 sub -block with mal decoding method cuts the histogram computation speed 

the exact pixels from the partially decoded reference frame. by half for a QCIF sequence, although the advantage was 

In Case (a), it will be seen that the entire 8x8 block in the 25 found to be lar e er for a ^^st resolution. Currently, a 

motion compensated predicted frame will have a single 512-dimensional histogram vector is used, and is formed by 

value. In Case (b), the 8x8 block may potentially have many concatenating a 256 bin grey scale (Y component) 

different values (i.e., pixels within it may have many dif- histogram, a 128 bin U component histogram and a 128 bin 

c , , w J « -,i » * *. «l j V component histogram. Note that the above discussion is 

ferent values). In order to lUustrate this further, consider an ^ v ^ y neQts of a frame individuallyj 

example of an 8x8 block going through Cases (a) and (b). 30 of c ' hroma form / 

FIG. 10 illustrates tins In FIG 10, assume that the predic- Extracting Par t icu]ar Frames fr om an MPEG-2 Bitstream 

Hon block is obtained from an I frame i.e., each 8x8 block Extracting particular frames from an MPEG-2 bitstream, 

has a single value associated with it in the prediction frame. ^ the embodiment described herein, is a two step procedure. 

Case (a) will lead to an 8x8 block in the current frame which \ n the first step, which is carried out concurrently with the 

has only one value fi. Case (b), will lead to an 8x8 block in 35 histogram calculation, a "bitstream index table" is generated 

the current frame which has potentially four different values. which contains the information necessary to quickly decode 

This does not cause much difference in the first few a randomly picked frame from the MPEG-2 bitstream. Once 

motion compensated frames (P or B frames) following an the keyframe hierarchy is generated, i.e., identities of key- 

intra -coded reference frame (or I frame). In fact, because of frames that will be in the hierarchical summary are defined, 

the insensitivity of the histogram computation to averaging 40 only the keyframes at the finest level of hierarchy need be 

and sub-sampling, it would seem that the two procedures decoded; frames at a coarse level of the hierarchy are a 

will be equally effective for histogram computation. subset of the frames at the finest level. The second step in the 

However, Case (a) should be favored because it involves less keyframe extraction procedure is carried out by the decoder 

computation and memory consumption. This occurs because manager, as explained below, which uses the bitstream index 

in any given frame (I or P or B), with motion compensation 45 table generated in the first step. 

performed as in Case (a), only one value for each 8x8 block A ™ e advantage of the above two step procedure over 

will be obtained. Ws, 8 times less capacity is needed in decodm g releva * Prions of the bitstream directly is a 

, . . ♦ 11 a a /o o \ » i saving in time that would be required to review the entire 

each dimension i.e., potentially 64 (8x8 ) times lesser ... , 6 . ^ c c - > ? T j,j 

t k,n f rt , ot rt ,;«nr or,t;~ fr* m «o ' <> r i~ae* ( *\ bitstream to the frame of interest. In order to decode frame 

memory than for storing entire irames. However, Case (a) , _ r « AnrkC unrn ... A _ ... A 

.7 ... i ■ ju i j number 1350 from a 1400 frame MPEG-2 bitstream without 

might lead to excessive degradation, as explained below, and 50 , . . " * . . ** 

hence is not a viable alternative a bltstream index table > 11 15 necessary to parse the entire 

a 4U , . _ ^ P ' #; bitstream up to frame number 1350, although it might not be 

As the number of contiguous motion compensated „ , /, . -j li , T r 

frames, i.e., without an intervening I frame, increases, the fecoded^Tms takes a considerable amount of time. If 

difference between Case (a) and Case (b) increases. Refer- "biUtream index table is available, one rnay go directly to 

ring back to FIG. 3, consider what happens when prediction SS * e «? ev « nt P°* lons ° f . me b^tream; thus parsing and 

is attempted from an already motion compensated frame, for ^^p* a ! Ko1 .^ mlmnlum "T? °^ 7% J 

, f n c * T»r C r> e ♦ I ne following lnlormation is needed in order to decode a 

example, from a P frame to a B frame or from a P frame to . , • i . * c j . L . • 

.if n c t/-« /\.u -u*j • randomly picked frame, reterred to herein as me current 

another P frame. In Case (a), the weighted averaging opera- „ / * «... 

tion is applied on the four blocks the prediction block frame ' from aD MPEG " 2 bltstream: 

overlaps, each of which has a single value, and finish with 60 ™ e most recent Sequence Header in the past (its byte 

a single value for the entire 8x8 predicted block. In Case (b), offset). 

because each block in the prediction frame may have 2 - bvte offiset of ^ current frame into the bitstream. 

potentially four (or more) different values, the current pre- 3. The most recent Quantization Matrix reset (if any) in 

dieted block may have a large number of different values. the past (its byte offset). 

Now, one should note the key difference which emerges 65 4. The reference frames (1-P/I-I/P-P) corresponding to the 

between Cases (a) and (b) as this chain of prediction from current frame, if the current frame is a B frame (their 

already motion compensated frames becomes large. Given a byte offsets). 
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5. The most recent I frame (which is the P frame's 
reference frame) if the current frame is a P frame (its 
byte offset). 

It is to be noted that in the Case of B/P frames, several 
frames other than the reference frame(s) may need to be 
decoded in order to correctly decode the reference frame(s). 
A common data structure to hold the above information has 
been developed to facilitate information exchange between 
the two steps, i.e., (1) generating the bitstream index table 
and (2) using the bitstream index table of the frame extrac- 
tion method by the decoder manager, described later herein. 
The following segment of C-code embodies the different 
flags used in formulating the bitstream index table, although 
it will be appreciated that this is merely an example, and that 
BIT may have any number of different syntax forms: 

enum IndexFileState { 

K w _SEQUENCE_JHEADER=0, 

K_PICTURE_IFRAME, 

K_PICTURE_BFRAME, 

K_PICTURE_PFRAME, 

K_Q U ANT_M ATR I X_EXTEN S ION, 

K_END_OF_DATA, 

K_OFFSET 

}; 

The K^OFFSET flag is added to any byte offset to 
differentiate it from the other flags defined above. Thus a 
byte offset of 15 would translate to 15+K_OFFSET («21) in 

terms of our representation. K END__OF_DATAisused a 

de-limiter between different events (for example sequence 
header and I frame or I frame and B frame etc.). To 
understand how the bitstream index table generated appears, 
suppose that the following sequence of events needed to be 
coded: 

1. Sequence Header starts at 0 bytes. 

2. I picture at 150 bytes 

3. P picture at 3000 bytes 

4. B picture at 4200 bytes 

5. B picture at 5300 bytes 

6. Quant Matrix reset at 5400 bytes 

7. P picture at 6200 bytes 

This sequence is converted into the following represen- 
tation: 

K_END_OF_DATA K_SEQUENCE_HEADER 

JC.OFFSET+0 BL_END_OF_DATA 
K_PICTURE_IFRAME K_OFFSET+150 K_END_ 

OF_DATA 

K_PICTURE_PFRAME K__OFFSET+3000 K_END_ 

OF__DATA 
K_PICTURE_BFRAME 

K_OFFSET + 4200 K_END_OF_D ATA 
K_PICTURE_BFRAME K_OFFSET+5300 

K_QUANT__MATRIX_EXTENSION K_OFFSET+ 
5400 K_END_OF_DATA 

K_PICTURE_PFRAME K_OFFSET+6200 K_EKD_ 
OF_DATA 

This in turn will yield a byte representation, using the 
C-data structure given above, of: 

5 0 6 5 1 156 5 2 3006 5 3 4206 5 3 5306 4 5406 5 2 6206 
5 

The spaces in the above byte- wise representation are 
necessary for the decoder to parse the bitstream. Note that 
the K_END_OF_DATA flag is, strictly speaking, redun- 
dant. How ever this flag may be used to prune out any 
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spuriously generated data (due to errors in the bitstreams), 
thus making the algorithm error resilient. The flag acts as a 
"sync signal" to remove spurious data; for example a 
sequence header not followed by a byte offset (due to an 
error in the bitstream) will be discarded. 

The decoder maoager which uses the above generated 
"bitstream index table" functions as follows: 

1. Initialize the last decoded frame number (ldf) to -1 

2. For each frame to be decoded (ftd) 

Find the frame at which to start decoding (sdf). 
If ftd is an I frame, then sdf=ftd. 
If ftd is a P frame, then sdf=most recent I frame 
before ftd. 

If ftd is a B frame, then sdf = most recent I frame 
before both of the anchor frames corresponding to 
ftd. 

If sdf obtained above is less than ldf +1 set sdf=ldf+l . 
Hius if sdf<ldf+l some of the required frames 
have been already decoded. 
For i=ldf+l to sdf 

Find the most recent sequence header (rsh). 
Find the most recent quant matrix reset (qmr), if qmr 
is greater than rsh (if any). 
Decode rsh, qmr in the order they appear in the 
bitstream. 

Decode all I and P frames sequentially starting from sdf 

till ftd-1. 
Decode ftd. 

In order to decode a stretch of frames, decode the first 
frame (ftd) following the decoder manager procedure, 
above. The rest of the frames are sequentially decoded till 
the end of the stretch. 

Field pictures need to be taken care of as a special case, 
if needed. One may possibly use the histograms of the 
even/odd fields, which ever is decoded first. The other field 
may not be decoded, in the case of B pictures, or may be 
decoded with the minimal decoding strategy, in the case of 
P and I pictures. The histograms need to be scaled by a factor 
of 2 if only one field is being decoded. It may also be 
possible that the extra decoded field may not be used in the 
computation of the histograms for P/I frames; in this case the 
histograms need not be scaled, because all frames have only 
one field contributing to the histogram. In order to differ- 
entiate fields from frames and take appropriate steps, the 
MPEG-2 bitstream provides two pieces of information from 
the picture header and picture coding extension: 

1 . temporal reference (in the picture header) provides the 
frame number being currently decoded. Note that the 
temporal reference is reset at the start of every Group 
of Pictures header. 

2. the picture structure (in the picture coding extension) 
provides the top/bottom field information. 

Generating a Pruned Bitstream Index Table for Compact 
Storage 

An important issue from an implementational point of 
view is the compact representation of the BIT to save disk 
space. At a first glance this might not seem important, since 
the bitstream index table may take only about 8-10 bytes of 
space for each frame, comparing with the large space 
occupied by the MPEG video. The over head may be 
reduced by taking the following steps: 

1 . Using incremental byte offsets rather than absolute byte 
offsets. This results in a good amount of saving for 
large sequences. 

2. Using a simple text compression algorithm, like gzip on 
Unix platforms or pkzip on PCs. 
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3. Removing the END_OF_DATA flag. 

4. Pruning the bitstream index table to store the minimal 
amount of information necessary to decode the key- 
frames with minimal decoding and parsing of the 
bitstream. 5 

Note that the last item will allow access to only the 
keyframe locations and does not let the user change the 
locations of the keyframes later on. It may be also noted that 
(1) and (4) are not completely compatible. If incremental 
byte onsets are to be used, the pruning algorithm of (4) 1Q 
should change the byte offsets to reflect the changed order of 
frames. In the current implementation, a Pruned Bitstream 
Index Table is generated, as described below. The decoder 
manager subsequently uses this pruned version of the table. 
The decision to use a pruned bitstream table is a function of 
the amount of storage space available and speed that is to be 15 
obtained: if storage space is available, there is no need to 
prune the bitstream table, as there will be room for the full 
table, and retrieval and manipulation will be much quicker. 
Pruned Bitstream Index Table 

The same structure is used as for the bitstream index table 20 
(BIT) but with a different organizational syntax to develop 
the Pruned bitstream index table (PBIT). Each keyframe is 
represented as a unit (between two K_END_OF_DATA 
flags) as opposed to each video frame being represented as 
a unit in the BIT. The following information is necessary in 25 
order to decode a particular (current) keyframe without 
parsing and decoding the entire bitstream: 

1. The type of current keyframe I/B/P. 

2. The Start Decode Frame (sdf) corresponding to the 
current keyframe (byte offset). Note that the actual sdf 30 
in terms of byte offset (not ldf+1, if sdf<ldf+l) needs to 
be stored because the decoder might not be operating 
sequentially to decode all the keyframes, as was 
assumed to be in the BIT version of the decoder 
manager. In the case that the decoder is operating 35 
sequentially and sdf<ldf+l, then decoding needs to 
start from ldf+1 . This information is already available 

as ldf+1 is the video frame following the previous 
keyframe, whose offset is available. 

3. Byte oflset of the current keyframe. 40 

4. The most recent sequence header offset. There is no 
need for the K_SEQUENCE__HEADER flag, as every 
valid MPEG-2 bitstream has a sequence header. 

5. If there was ever any quantization matrix reset, the 
quantization matrix offset needs to be stored, with the 45 
K_QUANT_MATRIX_EXTENSION flag, because 
there may not be any quantization matrix reset in a 
valid MPEG-2 bitstream. Note that quantization matrix 
resets need not be stored if the reset occurs before the 
sequence header since the sequence header's appear- 50 
ance automatically resets the quantization matrix. 

The decoder manager uses the sdf information, the type of 
the current keyframe and its byte offset as follows: if the 
desired current keyframe is of type I or P, the decoder 
manager will start decoding at sdf, and will parse the 55 
bitstream and look only for I and P frames. Such I and P 
frames will be decoded until the current desired keyframe is 
reached, which is also decoded. In this technique, the 
decoder manager does not have to check to see if any frame 
is a B frame and thus looks for only I and P frame headers, eo 
If the desired current keyframe is of type B, the decoder 
manager will consider each frame starting from sdf, will 
decode all I or P frames, and stop at every B frame and check 
to see if that frame is the desired keyframe. 

If the keyframe is of type B, one may want to approximate 65 
it with its most recently decoded reference (I or P) frame, in 
order to eliminate the need for parsing B frames. 
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Suppose that a B frame is the keyframe to be decoded 
which starts at 53500 bytes offset, needs a quantization 
matrix to be read from 43000 bytes and a sequence header 
to be read at 39000 bytes. The frame to start decoding begins 
at 45000 bytes. This data is encoded thus: 

K__END_OF_DATA K__PICTURE_B FRAME 
K_OFFSET+45000 

K_OFFSET+53500 K_OFFSET+39000 

K_Q U ANT_M ATRIX_EXTENS I O N K_OFFSET+ 
43000 K_JEND_OF_DATA 

It should be clear that PBIT may be further slimmed down 
at the cost of increasing computational time by increasing 
the amount of parsing that the decoder manager performs. 
Hence, there is a tradeoff between complexity of decoder 
manager and the size of PBIT. An appropriate balance may 
be made depending on application requirements. For 
instance, the PBIT may store the sdf and the byte-offsets of 
all I and P frames between the sdf and the current keyframe 
and the byte offset of the current frame, eliminating the need 
for the decoder manager to parse the bitstream. The 
consequence, however, is an increase in size for PBIT. At the 
other extreme, only the sdf and the byte offset of the 
keyframe is stored, resulting in the most compact represen- 
tation for PBIT but requiring that the decoder manager 
parses the bitstream between the start frame and the key- 
frame positions and decodes the I and P frames. 

In order to incorporate automatic pan/zoom detect/extract 
functionality, the entire frame bitstream may need to be 
decoded. 

Thus a system for reviewing keyframes of a digital video 
sequence has been disclosed. The input video stream may be 
conventional digital video, or may be an DCT-based com- 
pressed stream. Although a preferred embodiment of the 
invention, and several variations thereto have been 
disclosed, it should be appreciated that further variations and 
modifications may be made thereto without departing from 
the scope of the invention as defined in the appended claims. 

I claim: 

1. A method of hierarchical digital video summarization 
and browsing comprising: 

inputting a digital video signal for a digital video 
sequence; and 

generating a hierarchical keyframe summary, including 
dividing the hierarchical keyframe summary into mul- 
tiple level summaries, including a most compact level 
summary, a coarse level summary, and a finest level 
summary, and 

identifying keyframes by setting k-^1, where to=0 and the 
second frame is chosen as a candidate for being the first 
keyframe; defining, for i=l through K-l, and t ( =2 
k ( -t I _ 1 , k /+J to be the first video frame for which 
2C(t,)-C(k ( ^C(k i+:l ) holds; and for i=K, computing 
\ > /c=2k K -t K _ 1 , and unless tj.^2 k /ir -t Ar _ 1 =t' /c , keeping the 
results of the previous iteration, add an offset to all k,s 
so that tjfVjc, and stopping, otherwise, increment k a by 
1 and go to said defining; and 

identifying keyframes including starting from the second 
keyframe positioned at by the largest consecutive 
difference criteria. 

2. The method of claim 1 which includes, after said 
inputting, computing histograms for the digital video 
sequence; detecting shot boundaries within the digital video 
sequence; determining the number of keyframes allocated 
within each shot; and pruning keyframes for a shot without 
meaningful action. 

3. The method of claim 2 which includes, after said 
generating, browsing the keyframes using the hierarchical 
keyframe summary. 
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4. The method of claim 2 which, after said inputting, 
includes detecting and removing dissolve events. 

5. The method of claim 4 which, after said detecting and 
removing dissolve events includes detecting global motion 
events by detecting frames within the digital video sequence 5 
that include events taken from the group of events consisting 
of pan events and zoom events. 

6. The method of claim 5 which includes detecting pan 
events and building an image mosaic. 

7. The method of claim 5 which includes detecting zoom 10 
events, estimating the degree of zoom in the event, and 
compiling a zoom summary. 

8. The method of claim 5 which includes excluding global 
motion events from the hierarchical summarization process. 

9. The method of claim 1 which includes browsing the 15 
keyframes by a user after selecting a specific level summary. 

10. The method of claim 1 wherein keyframes in the 
keyframe hierarchical summary may be spatially sub- 
sampled into thumbnails for storage, retrieval or display. 

11. The method of claim 1 wherein said generating a 20 
hierarchical keyframe summary includes clustering key- 
frames and generating keyframes of a coarser level sum- 
mary. 

12. The method of claim 11 wherein said clustering 
includes producing a compaction ratio in the number of 25 
keyframes at the coarser level. 

13. The method of claim 11 wherein said clustering 
includes pairwise clustering. 

14. The method of claim U wherein said generating 
keyframes of a coarser level summary includes generating 30 
keyframes using largest consecutive difference criteria. 

15. The method of claim 1 wherein said computing 
includes locating the last keyframe of the shot adjacent the 
midpoint between t^ and t^. 

16. The method of claim 1 wherein said identifying 35 
includes selecting every (n/K)th frame as a keyframe. 

17. The method of claim 1 wherein said identifying 
includes detecting uninteresting shots and eliminating their 
keyframes from the hierarchical keyframe summary. 

18. The method of claim 1 wherein said inputting includes 40 
inputting a compressed digital video sequence and generat- 
ing a bitstream index table, wherein said computing histo- 
grams includes only partially decoding the compressed 
digital video sequence. 
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19. The method of claim 18 wherein said allocating 
keyframes within each shot includes fully decoding the 
keyframe. 

20. The method of claim 19 wherein said fully decoding 
the keyframe includes decoding the keyframe without pars- 
ing the video bitstream and without completely decoding the 
video bitstream by using a bitstream index table. 

21. The method of claim 18 wherein said partially decod- 
ing a DCT-based compressed video includes using the DC 
value of DCT coefficients to compute a histogram. 

22. The method of claim 18 wherein said partially decod- 
ing includes decoding only keyframes and their reference 
frames. 

23. The method of claim 18 wherein said decoding 
includes decoding by a decoder manager. 

24. The method of claim 23 wherein said decoding by a 
decoder manager includes using a bitstream index table to 
decode the keyframes with minimal decoding and parsing of 
the entire video bitstream. 

25. The method of claim 24 wherein said decoding by the 
decoder manager includes generating a pruned bitstream 
index table and storing only the information needed to 
decode keyframes without parsing and decoding the entire 
bitstream. 

26. The method of claim 1 which further includes gener- 
ating one or more coarser-level summaries from a given 
keyframe summary by statistical clustering histogram vec- 
tors of keyframes. 

27. The method of claim 26 where only those keyframes 
that are consecutive in time are allowed to be included in the 
same cluster. 

28. The method of claim 27 where clustering is performed 
using a pairwise K-means clustering algorithm. 

29. The method of claim 28 which includes selecting 
keyframes to represent keyframe clusters and choosing 
keyframes as those frames within the clusters whose histo- 
gram vectors are closest to the centroid vectors of the 
clusters. 

30. The method of claim 28 wherein said selecting 
includes selecting the keyframes in the second and existing 
subsequent clusters on the basis of largest consecutive 
difference criterion. 

***** 
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